Introducing LiteRT Next: A new set of APIs that improves and simplifies on-device hardware acceleration.

LiteRT Next Overview

LiteRT Next is a new set of APIs that improves upon LiteRT, particularly in terms of hardware acceleration and performance for on-device ML and AI applications. The APIs are an alpha release and available in Kotlin and C++.

The LiteRT Next Compiled Model API builds on the TensorFlow Lite Interpreter API, and simplifies the model loading and execution process for on-device machine learning. The new APIs provide a new streamlined way to use hardware acceleration, removing the need to deal with model FlatBuffers, I/O buffer interoperability, and delegates. The LiteRT Next APIs are not compatible with the LiteRT APIs. In order to use features from LiteRT Next, see the Get Started guide.

For example implementations of LiteRT Next, refer to the following demo applications:

Quickstart

Running inference with the LiteRT Next APIs involves the following key steps:

Load a compatible model.
Allocate the input and output tensor buffers.
Invoke the compiled model.
Read the inferences into an output buffer.

The following code snippets show a basic implementation of the entire process in Kotlin and C++.

C++

// Load model and initialize runtime
LITERT_ASSIGN_OR_RETURN(auto model, Model::CreateFromFile("mymodel.tflite"));
LITERT_ASSIGN_OR_RETURN(auto env, Environment::Create({}));
LITERT_ASSIGN_OR_RETURN(auto compiled_model,
    CompiledModel::Create(env, model, kLiteRtHwAcceleratorCpu));

// Preallocate input/output buffers
LITERT_ASSIGN_OR_RETURN(auto input_buffers, compiled_model.CreateInputBuffers());
LITERT_ASSIGN_OR_RETURN(auto output_buffers, compiled_model.CreateOutputBuffers());

// Fill the first input
float input_values[] = { /* your data */ };
input_buffers[0].Write<float>(absl::MakeConstSpan(input_values, /*size*/));

// Invoke
compiled_model.Run(input_buffers, output_buffers);

// Read the output
std::vector<float> data(output_data_size);
output_buffers[0].Read<float>(absl::MakeSpan(data));

Kotlin

// Load model and initialize runtime
val  model =
    CompiledModel.create(
        context.assets,
        "mymodel.tflite",
        CompiledModel.Options(Accelerator.CPU)
    )

// Preallocate input/output buffers
val inputBuffers = model.createInputBuffers()
val outputBuffers = model.createOutputBuffers()

// Fill the first input
inputBuffers[0].writeFloat(FloatArray(data_size) { data_value /* your data */ })

// Invoke
model.run(inputBuffers, outputBuffers)

// Read the output
val outputFloatArray = outputBuffers[0].readFloat()

For more information, see the Get Started with Kotlin and Get Started with C++ guides.

Key features

LiteRT Next contains the following key benefits and features:

New LiteRT API: Streamline development with automated accelerator selection, true async execution, and efficient I/O buffer handling.
Best-in-class GPU Performance: Use state-of-the-art GPU acceleration for on-device ML. The new buffer interoperability enables zero-copy and minimizes latency across various GPU buffer types.
Superior Generative AI inference: Enable the simplest integration with the best performance for GenAI models.
Unified NPU Acceleration: Offer seamless access to NPUs from major chipset providers with a consistent developer experience. LiteRT NPU acceleration is available through an Early Access Program.

Key improvements

LiteRT Next (Compiled Model API) contains the following key improvements on LiteRT (TFLite Interpreter API). For a comprehensive guide to setting up your application with LiteRT Next, see the Get Started guide.

Accelerator usage: Running models on GPU with LiteRT requires explicit delegate creation, function calls, and graph modifications. With LiteRT Next, just specify the accelerator.
Native hardware buffer interoperability: LiteRT does not provide the option of buffers, and forces all data through CPU memory. With LiteRT Next, you can pass in Android Hardware Buffers (AHWB), OpenCL buffers, OpenGL buffers, or other specialized buffers.
Async execution: LiteRT Next comes with a redesigned async API, providing a true async mechanism based on sync fences. This enables faster overall execution times through the use of diverse hardware – like CPUs, GPUs, CPUs, and NPUs – for different tasks.
Model loading: LiteRT Next does not require a separate builder step when loading a model.